BANNER-CHEMDNER: Incorporating Domain Knowledge in Chemical and Drug Named Entity Recognition
نویسندگان
چکیده
Exploiting unlabeled text data to leverage the system performance has been an active and challenging research topic in text mining, due to the recent growth of the amount of biomedical literature. Named entity recognition is an essential prerequisite task before effective text mining of biomedical literature can begin. The participants of the CHEMDNER task of the BioCreative IV challenge are asked to develop a chemical compounds and drugs mention recognition system and are given a set of annotated PubMed documents for training and a set of selected PubMed documents for evaluation of their systems by the organizers. Our primary goal is to develop a named entity recognition system that can scale well over millions of documents and can easily be plugged in a biomedical text mining system, while exploiting unlabeled data to leverage the system performance. We extracted Brown cluster labels and word embeddings, both induced from a large unlabeled document corpus as the word representation features in addition to the word, the word and character n-gram, and the traditional orthographic features. During the training, 2 ± order CRF model was built with varied feature spaces and the influence of the different groups of features defining the baselines of the experiment was observed in terms of the F-score, recall, precision as well as processing time. Our system achieves 81.41% and 82.58% of Fscores on CHEMDNER development set for CEM and CDI sub-tasks, processing ~530 documents per minute.
منابع مشابه
Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations
BACKGROUND Chemical and biomedical Named Entity Recognition (NER) is an essential prerequisite task before effective text mining can begin for biochemical-text data. Exploiting unlabeled text data to leverage system performance has been an active and challenging research topic in text mining due to the recent growth in the amount of biomedical literature. We present a semi-supervised learning m...
متن کاملImprovement of Chemical Named Entity Recognition through Sentence-based Random Under-sampling and Classifier Combination
Chemical Named Entity Recognition (NER) is the basic step for consequent information extraction tasks such as named entity resolution, drug-drug interaction discovery, extraction of the names of the molecules and their properties. Improvement in the performance of such systems may affects the quality of the subsequent tasks. Chemical text from which data for named entity recognition is extracte...
متن کاملComparison of different strategies for utilizing two CHEMDNER corpora
To identify chemical entities and drug names in patent according to CHEMDNER patent task-CEMP subtask, we use machine learning technique to construct a chemical named entity recognition (CNER) system. It is desirable for machine-based CNER system to have large training examples. Two CHEMDNER corpora have been developed. One is the corpus for the patent task and the other is the CHEMDNER corpus ...
متن کاملCHEMDNER system with mixed conditional random fields and multi-scale word clustering
BACKGROUND The chemical compound and drug name recognition plays an important role in chemical text mining, and it is the basis for automatic relation extraction and event identification in chemical information processing. So a high-performance named entity recognition system for chemical compound and drug names is necessary. METHODS We developed a CHEMDNER system based on mixed conditional r...
متن کاملEnhancing of chemical compound and drug name recognition using representative tag scheme and fine-grained tokenization
BACKGROUND The functions of chemical compounds and drugs that affect biological processes and their particular effect on the onset and treatment of diseases have attracted increasing interest with the advancement of research in the life sciences. To extract knowledge from the extensive literatures on such compounds and drugs, the organizers of BioCreative IV administered the CHEMical Compound a...
متن کامل